January 21, 2025
Let’s start with a generic truth: the generalization error will always be greater than or equal to the training error
Proving this is true is a bit difficult.
But, we can build a proof by making an assumption beyond the i.i.d. assumption
Convenience Assumption - Fixed X
Let’s assume that we have two different i.i.d. data sets pulled from \mathcal T of size N, \{\mathbf X_D , \mathbf y_D\} and \{\mathbf X_D , \mathbf y_D'\}
The training features are the same
But the actual outcomes differ due to different noise draws
y = f(\mathbf x) + \epsilon \text{ ; } y' = f(\mathbf x) + \epsilon'
Last time, we showed that:
\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum \limits_{i = 1}^N Cov(y_i , \hat{y}_i)
where the covariance term is the correlation between the training outcomes and the predictions made at each \mathbf x_i from the training data.
For any good predictive model, we expect that these two things are positively correlated.
The higher the covariance of the predictions and the training inputs, the larger the gap between generalization error and training error
When might we expect the covariance between a prediction at \mathbf x and the observed outcome at that point to be high?
Note:
For all learning algorithms, the prediction at any point is a function of the training features and outcomes and the associated feature vector.
\hat{y}_0 = g(\mathbf x_0, \mathbf X , \mathbf y)
Let’s explore this with another example.
Polynomial Regression with 1 feature:
y = \alpha + \sum \limits_{j = 1}^D \beta_j x^j
As the polynomial degree increases, the wiggliness of the function increases.
The feature matrix grows through basis expansion as the degree increases. For a third degree polynomial:
h(\mathbf x) = [1 , x , x^2, x^3]
Let’s assume that we estimated the coefficients using ordinary least squares on the training data:
\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y
So, the prediction at any point \mathbf x_0 becomes:
\hat{y}_0 = \mathbf x_0^T \hat{\boldsymbol \beta} = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y
Each prediction depends on the vector of training outcomes!
The training features and value at which the prediction is made create weights for the training outcomes
The prediction at any point is a weighted combination of the training outcomes.
We want to learn the covariance between the training outcomes and the predictions made for the training data
\sum Cov(\hat{ y}_i, y_i)
where
\hat{\mathbf y} = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y
So, we’re trying to compute
Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y]
Since covariance is a linear operator:
Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T Cov[\mathbf y,\mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \sigma^2
For any linear model, we can compute the sum of the self-covariances as:
\sigma^2 \text{tr}\left(\mathbf X(\mathbf X^T \mathbf X)^{-1} \mathbf X^T\right)
where \text{tr}() is the trace of the matrix (the sum of the diagonal elements)
The trace is invariant to cyclic transformation
\sigma^2 \text{tr}\left((\mathbf X^T \mathbf X)^{-1} \mathbf X^T\mathbf X\right)
The sum of the self-covariances is then:
P \sigma^2
and the training error offset term is:
\frac{2P\sigma^2}{N}
Let’s think about what this means:
As the number of implied features increases, the covariance between the training outcomes and the predictions increases
Put another way, the more implied features in the model, the more the prediction at \mathbf x depends on the value of the outcome at that point and the less it depends on the the value of the outcome at surrounding points
For the polynomial model, as the degree increases, more terms.
This makes sense because we’re overresponding to wiggliness in the training data
Overfitting
This effect is diminished as N gets larger
Three factors that effect generalization performance
Training error
The covariance of training outcomes and predictions
The sample size
Lower any of these (or increase the sample size) and decrease the generalization error!
Generated from a 3rd degree polynomial with \sigma^2 = 9.
This is great!
But, this result really only applies when we can compute the covariance of the prediction and the outcome
What is this covariance for an SVM with an RBF kernel?
What is this covariance for a random forest?
To make more headway here, we need to present more general results.
We’ll make the most progress exploring our two common loss functions.
Let’s start with MSE for a continuous outcome.
Generalization error under mean squared error:
E_\mathcal T[(y - \hat{y})^2]
In our previous example, we made a fixed \mathbf X assumption
Let’s not make that assumption.
Recall that \mathcal T is a multivariate distribution over the features and the outcomes. We can decompose \mathcal T into two components:
\mathcal T = P(\mathbf x) P(y | \mathbf x)
When we do not make the fixed \mathbf X assumption, there are two random components in the process assuming each draw is an i.i.d. draw from \mathcal T:
The training set - if reached into \mathcal T and drew another training set of size N, then I would expect that we would receive a slightly different training set (both features and outcomes)
The outcomes conditional on the feature draws - even if I drew the same \mathbf x, I would expect that this value differs depending on the values of the noise draws
These two sources of randomness mean that we can split our expectation into two pieces:
E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ E_D \left[ (y - \hat{y})^2 \right] \right]
D is our training data. Note that the prediction only depends on the training data drawn. With a different draw of D from \mathcal T, our prediction could be a little different. The prediction made at \mathbf x given the observed training set is \hat{y}.
y is random only in terms of the error variance. Given a value of \mathbf x, y = f(\mathbf x) + \epsilon.
We can evaluate these expectations to get a workable form:
E_\mathcal T[(y - \hat{y})^2] = V_{y | x}[y] + V_D[\hat{y}] + (E_D[\hat{y}] - f(\mathbf x))^2
where f(\mathbf x) is the true value of the signal evaluated at \mathbf x (Derivation included at the end of the slides)
Let’s look at each of these terms separately.
Bias
(E_D[\hat{y}] - f(\mathbf x))^2
Two ways that bias can change:
The flexibility of the chosen model with respect to the true functional form
Estimation bias for any unknowns in the chosen learning method (more on this later)
Let’s look back at our polynomial model with one feature. Generate a large number of i.i.d. training sets from data generating distribution to see the different models we uncover with 1 polynomial term.
We see that there is a little variation over different i.i.d. training sets of size 100. Let’s look at the average prediction over 1000 i.i.d. training set.
Repeated for a 2nd degree polynomial
Repeated for a 3rd degree polynomial
And a 10th degree polynomial
Since we know the true data generating function, let’s compute (or at least closely approximate) the bias for each polynomial degree between 1 and 15
When we underfit, the bias tends to be high (lower than true polynomial degree)
When we get the right model or overfit, the bias doesn’t decrease anymore!
Note that the bias tends to be bounded at the simplest possible model
Bias takeaways:
Model Variance
V_D[\hat{y}]
Two ways that model variance can change:
The flexibility of the chosen model with respect to the true functional form
The size of the training sample
Generate a large number of i.i.d. training sets from data generating distribution to see the different models we uncover with different polynomial degrees.
When the model is least complex, the model variance is its lowest
As the complexity increases, the model variance increases
Model variance tends to increase in an unbounded way as the complexity increases.
What happens as N changes?
As N gets larger, the model variance tends to uniformly decrease across complexity
Model variance takeaways:
As the model becomes more complex, the model variance increases
As we have a larger training set, the model variance decreases
Irreducible Error
V_{y | x}[y]
Just the cost of doing business:
We don’t have all of the features
We may not be able to uncover all of the functional variations that lead to the outcome
It’s just always there and cannot be reduced
This leads us to the common adage:
\text{Generalization Error} = \text{Irreducible Error} + \text{Bias}^2 + \text{Model Variance}
Let C be a value that represents the complexity of the chosen model
We can use our general observations about MSE generalization error to make a statement:
\text{Generalization Error} = \sigma^2 + \mathcal O\left(\frac{1}{C}\right) + \mathcal O \left(\frac{C}{N}\right)
As C increases, bias decreases
As C increases, model variance increases
However, with large N, the model variance increase is attenuated and we can tolerate more flexibility!
This leads us to a balance that must exist for good generalizable models:
If C is too high, the model variance overwhelms any reduction in bias (overfitting)
If C is too low, the bias overwhelms any reduction in model variance (underfitting)
This is referred to as the bias-variance tradeoff
There exists some optimal balance point in terms of complexity that is neither too complex or too simple
The generalization under 0/1 loss is:
E_\mathcal T[I(y \neq \hat{y})]
For any test instance, we’re either right or we’re wrong
We would like to know over the entirety of the data generating distribution what is the probability that any one instance is incorrect.
This expectation doesn’t really have any super nice decompositions
This matters because we can’t do the same thing that we did for MSE to show how the generalization error is determined by the bias and variance of the model.
Are there any guarantees about generalization error for the 0/1 loss function?
Let’s assume that we’re working with a two-class problem and our goal is to compute the expected 0/1 loss for the true data generating process, \mathcal T.
The Vapnik-Chervonenkis (VC) Bound
For a binary classification problem with a training set of size N, with probability 1 - \delta the following inequality on the 0/1 error holds:
\text{Generalization Error} \le \text{Training Error} + \sqrt{\frac{1}{N} \left[ d \left(\log \frac{2N}{d} + 1\right) - \log \frac{\delta}{4} \right]}
The proof to get this bound is gnarly
Relies on Hoeffding’s inequality, the concept of PAC learnability, and Sauer’s lemma
The interested student can find this proof in Chapter 6 of “Understanding Machine Learning: Theory and Algorithms”
Note that N - the training size - and \delta - the probability of the guarantee - are fixed and known!
The new variable here is d
Formal definition:
For a learning method, \mathbf H, the VC dimension is defined as the size of the largest finite subset of an instance space, \Omega, shattered by \mathbf H
A classifier is said to shatter an instance space of size d if the classifier can create correct decision regions over all 2^d sets of unique labelings over a fixed set of non-collinear points.
This doesn’t make a lot of sense on its own.
So, let’s draw some pictures!
Linear classifier in 2D feature space with 2,3, and 4 points
Quadratic classifier in 2D space with 4 points
The VC dimension is a statistically sound way of measuring the complexity of a classifier
Some methods have a finite VC dimension:
Any linear classifier using P features with a free intercept has a VC dimension of P + 1
Gradient boosted linear classifiers (with T parts of the ensemble) have a VC dimension of T(P + 1)(2 + 3 \log T(P + 1))
A linear model estimating P coefficients and an intercept has a VC dimension of P + 1
Other algorithms have an infinite VC dimension:
1NN classifiers (are we guaranteed that that the trained classifier ever actually generalizes?)
RBF SVMs
Could have infinitely bad generalization performance
But doesn’t because the VC bound is a worst-case bound
Complicated point, happy to discuss more in office hours
What happens to the training offset as d and N change?
Some observations:
This bound is not sharp - the 0/1 error can’t be greater than 1. The VC bound shouldn’t really be expected to give you a good measure of the actual value of the generalization error.
As the classifier becomes more expressive the generalization error increases for fixed N.
This increase is offset as N increases.
Consider a family of functional forms that can range from a small finite VC dimensionality to an infinite VC dimensionality
Example: Polynomial Logistic Regression
2 features:
First order - [x_1,x_2]
Second order - [x_1,x_2,x_1^2,x_2^2,x_1x_2]
Third order - [x_1,x_2,x_1^2,x_2^2,x_1x_2,x_1^3,x_2^3,x_1x_2^2,x_1^2x_2]
For the training error, we expect that the training error decreases as the VC dimension/polynomial order of the classifier increases
In theory, with enough polynomial terms, there is no training set that can’t be matched perfectly!
However, at a certain point, there exists some VC dimensionality where adding more complexity does not viably increase the training error anymore
But, we do continue to increase the training offset!
This type of learning algorithm that can range from simple to arbitrarily complex (indexed by tuning parameters) and guarantee 0 training error with enough complexity are called universal approximators
This means that we can rexpress the VC bound in terms of complexity, C:
\text{Generalization Error} \approx \mathcal O \left( \frac{1}{C} \right) + \mathcal O \left( \frac{C}{N} \right)
which looks a lot like the bias/variance tradeoff!
Takeaway:
Even for classification, we need to worry about overfitting
Overall takeaways:
The optimal model in terms of generalization occurs when complexity of the learner is balanced - not too simple, not too complex, just right
Overfitting occurs when we allow excess complexity that no longer outweighs any gains in generalization due to decrease in model bias
We can tolerate more complexity as N gets larger. For truly complex problems (hint hint deep learning), you’ll need a lot of training data
All of this is theoretical - to give you an understanding of when we can tolerate a little more complexity
However, none of these tools really give us a general way to estimate the generalization error
Why do we want to measure generalization error for a specific model?
The theoretical tool:
We can replicate evaluating the expectation over the true data generating distribution if we have a true heldout test set
As the size of the heldout data gets large, the average error on the test set converges to the generalization error!
LLN at work
The important rule:
The test set cannot be used to train the model in any sense to adequately approximate the generalization error
The moment we use data to train the model, the predictions become correlated with the features and outcomes
Leads the estimate to approach training error which we know is lower than the generalization error
We also need an “out-of-sample” data set that can be used to choose the appropriate value for any tuning parameters
This data set is used to assess something like the generalization error for different values of K and choose the one that is likely to give us the lowest generalization error!
But, once we use data to make this choice, we can no longer use it to adequately estimate the true generalization error!
The gold standard:
Start with all of your data and split it into three pieces
A training set used to estimate model coefficients (like regression coefficients)
A validation set used to choose tuning parameters (like the regularization factor in LASSO or K in KNN)
A test set used to estimate the true expected loss for a new prediction made with the model that never touches the training procedure
An issue:
Each of the quantities we want to estimate using each of the three pieces will be more accurate the larger each split set is!
Bigger N means more accurate coefficients
Bigger N means a better estimate of the generalization error for choosing tuning parameters
Bigger N means a better estimate of the true generalization error
If the original training set isn’t really large, doing a three part split leads to poor estimates throughout!
A rule of thumb:
If your training data is really large, just create a three part split and use the same validation set each time to choose the values of any tuning parameters
If your training data is smaller, use K-fold cross validation methods to choose the values of any tuning parameters
Regardless of choice, you always want to set aside an adequately large true heldout test set to assess the generalization error for any final chosen model!
You already know all of this stuff, so we’re not going to burn too much time on this point!
ML is all about generalization performance
We’re going to get more applied and discuss regression methods
Linear and Logistic Regression
Log Likelihoods and Equivalence to MSE/Probability Loss
Regularization and Penalized Methods for Better Generalization
These two sources of orthogonal randomness mean that we can split our expectation into two pieces:
E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ E_D \left[ (y - \hat{y})^2 \right] \right]
D is our training data. Note that the prediction only depends on the training data drawn. With a different draw of D from \mathcal T, our prediction could be a little different. The prediction made at \mathbf x given the observed training set is \hat{y}.
y is random only in terms of the error variance. Given a value of \mathbf x, y = f(\mathbf x) + \epsilon.
A neat random variable rule. Suppose a is a random variable and b is a constant, then:
E_a[(a - b)^2] = V_a[a] + (E_a[a] - b)^2
Let’s start by applying the expectation over the training data:
E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ V_D[\hat{y}] + (E_D[\hat{y}] - y)^2 \right]
Note that \hat{y} is not random with respect to the true value of y - the training set is assumed fixed w.r.t. to the noise.
Now applying the other expectation:
E_\mathcal T[(y - \hat{y})^2] = V_D[\hat{y}] + V_{y | x}[y] + (E_D[\hat{y}] - E_{y | x}[y])^2
And finally assuming that the expected value of the noise draw is zero:
E_\mathcal T[(y - \hat{y})^2] = V_D[\hat{y}] + V_{y | x}[y] + (E_D[\hat{y}] - f(\mathbf x))^2
where f(\mathbf x) is the true function that maps the features to the outcome (e.g. the signal).